formal statement
related work
Deterministic RLDeterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [81] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [21] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [21] learns the optimal policy with the number of trajectories linear with the eluder dimension. Consider MDPM where the transition is deterministic. Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. For any t (0,1), if d โฆ(log(BW/ฮป))and n d poly(ฮบ,k,ฮป,BW,Bฯ,H,log(d/t)), then with probability at least 1 tAlgorithm 1 returns the optimal policy ฯ .
Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
Zhou, Xinyuan, Lei, Yi, Zhou, Xiaoyu, Sun, Jingyi, Zhu, Yu, Ye, Zhongyi, Zhang, Weitai, Liu, Quan, Wei, Si, Liu, Cong
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.
REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Shen, Ziju, Huang, Naohao, Yang, Fanyi, Wang, Yutong, Gao, Guoxiong, Xu, Tianyi, Jiang, Jiedong, He, Wanyi, Yang, Pu, Sun, Mengzhou, Ju, Haocheng, Wu, Peihao, Dai, Bryan, Dong, Bin
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
Lean Finder: Semantic Search for Mathlib That Understands User Intents
Lu, Jialin, Emond, Kye, Yang, Kaiyu, Chaudhuri, Swarat, Sun, Weiran, Chen, Wuyang
We present Lean Finder, a semantic search engine for Lean and mathlib that understands and aligns with the intents of mathematicians. We further align Lean Finder with mathematicians' preferences using In addition, Lean Finder is compatible with LLM-based theorem provers, bridging retrieval with formal reasoning. Advances in Lean and mathlib (De Moura et al., 2015; Moura & Ullrich, 2021) are turning mathematical discovery into a collaborative and verifiable research workflow. Despite these advances, state-of-the-art LLMs still cannot solve math research problems. Lean's syn tax, gram mar, and tac tics in cur a steep learn ing curve. All experiments and data processing were conducted outside Meta. Figure 1: In the evaluation with user queries, real users preferred Lean Finder in 81.6% of cases, compared with Consider the two queries below. Lean search engines handle (Gao et al., 2024a;b; Ju & Dong, 2025; Asher, 2025): Denote L/K a field extension, x, y in L are algebraic elements over K with the same minimal polynomial. I'm working with algebraic elements over a field extension and I have two elements, say x and y in L. I know x is algebraic over K, and I've shown that y is a root of the minimal polynomial of x. Does this imply that the minimal polynomials of x and y are actually equal? T arget Statement 2: 1 theorem eq_of_root {x y: L} (hx: IsAlgebraic K x) (h_ev: Polynomial.aeval y (minpoly K x) = 0): minpoly K y = minpoly K x):= -- proof omitted for brevity This user latent (motivation, perspective, abstraction) cannot be inferred or encoded by a purely syntactic informalization. Addressing this challenge calls for Lean search engines that can understand a mathematician's intent, not merely We defer a more rigorous analysis in Section 2.2, and ask our core question: Our approach analyzes and clusters public discussions, then synthesizes queries that simulate user intents (Section 3.1).
Conjecturing: An Overlooked Step in Formal Mathematical Reasoning
Sivakumar, Jasivan Alex, Borchert, Philipp, Cardenas, Ronald, Lampouras, Gerasimos
Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. However, the conjecture should not be assumed to be provided. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformali-sation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning. Natural language reasoning with Large Language Models (LLMs) has emerged as a powerful tool for solving complex mathematical problems. Its effectiveness is highlighted by recent breakthroughs, such as AI systems from OpenAI and Google solving five of six problems from the 2025 International Mathematics Olympiad (IMO) using natural language (Metz, 2025). The critical caveat is that these informal solutions require validation by expert mathematicians, a process that is prone to human error and lack scalability (Gou ezel & Shchur, 2019). Proof assistants like Isabelle (Wenzel et al., 2008) and Lean (Moura & Ullrich, 2021) provide a path toward automated verification at scale through formal reasoning.
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition
Li, Yupei, Borchert, Philipp, Lampouras, Gerasimos
Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.